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Abstract: During the hfe cycle of an XML apphcation, both schemas and 
queries may change from one version to another. Schema evolutions may af- 
fect query results and potentially the validity of produced data. Nowadays, a 
challenge is to assess and accommodate the impact of theses changes in rapidly 
evolving XML applications. 

This article proposes a logical framework and tool for verifying forward/backward 
compatibility issues involving schemas and queries. First, it allows analyzing 
relations between schemas. Second, it allows XML designers to identify queries 
that must be reformulated in order to produce the expected results across suc- 
cessive schema versions. Third, it allows examining more precisely the impact 
of schema changes over queries, therefore facilitating their reformulation. 
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Ensuring Query Compatibility with Evolving 
XML Schemas 



Resume : Durant le cycle de vie d'une application XML, a la fois les schemas et 

Ics rcquctcs sont amcncs a cvolucr d'une version a unc autre. Les evolutions de 
schemas peuvent aflFecter les resultats des requetes et potentiellement la validite 
des donnees produites. De nos jours, un vrai defi consiste a evaluer et a prendre 
en compte I'impact de ces changements dans des applications XML qui evoluent 
rapidement. 

Get article propose un cadre logique et un outil pour la verification des 

compatibilitcs ascendante et descendante des schemas et des requetes. Tout 
d'abord, il permet d'analyser les relations entre les schemas. Ensuite, il permet 
au concepteur XML d'identifier les requetes qui doivent etre reformulees afin de 
produire les resultats attendus a travers les versions successives des schemas. 
Enfin, il permet d'examiner de maniere plus precise I'impact des changements 
des schemas sur les requetes, facitilitant de ce fait leur formulation. 

Mots-cles : XML, Schema, Requetes, XPath, Evolution, Compatibilite, Ana- 
lyse 
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1 Introduction 

XML is now commonplace on the web and in many information systems where it 
is used for representing all kinds of information resources, ranging from simple 
text documents such as RSS or Atom feeds to highly structured databases. 
In these dynamic environments, not only data are changing steadily but their 
schemas also get modified to cope with the evolution of the real world entities 
they describe. 

Schema changes raise the issue of data consistency. Existing documents and 
data that were valid with a certain version of a schema may become invalid 
on a new version of the schema (forward incompatibility). Conversely, new 
documents created with the latest version of a schema may be invalid on some 
previous versions (backward incompatibility). 

In addition, schemas may be written in different languages, such as DTD, 
XML Schema, or Relax-NG, to name only the most popular ones. And it is 
common practice to describe the same structure, or new versions of a structure, 
in different schema languages. Document formats developed by W3C provide 
a variety of examples: XHTML 1.0 has both DTDs and XML Schemas, while 
XHTML 2.0 has a Relax-NG definition; the schema for SVG Tiny 1.1 is a DTD, 
while version 1.2 is written in Relax-NG; MathML 1.01 has a DTD, MathML 
2.0 has both a DTD and an XML Schema, and MathML 3.0 is developed with 
a Relax-NG schema and is expected to have also a DTD and an XML Schema. 
An issue then is to make sure that schemas written in different languages are 
equivalent, i.e. they describe the same structure, possibly with some differences 
due to the expressivity of the language [14] . Another issue is to clearly identify 
the differences between two versions of the same schema expressed in differ- 
ent languages. Moreover, the issues of forward and backward compatibility of 
instances obviously remain when schema languages change from a version to 
another. 

Validation, and then compatibility, is not the only purpose of a schema. 
Validation is usually the first step for safe processing of documents and data. It 
makes sure that documents and data are structured as expected and can then 
be processed safely. The next step is to actually access and select the various 
parts to be handled in each phase of an apphcation. For this, query languages 
play a key role. As an example, when transforming a document with XSL, 
XPath queries are paramount to locate in the original document the data to be 
produced in the transformed document. 

Queries are affected by schema evolutions. The structures they return may 
change depending on the version of the schema used by a document. When 
changing schema, a query may return nothing, or something different from what 
was expected, and obviously further processing based on this query is at risk. 

These observations highlight the need for evaluating precisely and safely 
the impact of schema evolutions on existing and future instances of documents 
and data. They also show that it is important for software engineers to pre- 
cisely know what parts of a processing chain have to be updated when schemas 
change. In this paper we focus on the XPath query language which is used in 
many situations while processing XML documents and data. The XSL trans- 
formation language was already mentioned, but XPath is also present in XLink 
and XQuery for instance. 



RR n° 6711 



4 



Geneves, Layaida, & Quint 



Related Work 

Schema evolution is an important topic and has been extensively explored in 
the context of relational, object-oriented, and XML databases. Most of the 
previous work for XML query reformulation is approached through reductions 
to relational problems 0] . This is because schema evolution was considered as a 
storage problem where the priority consists in ensuring data consistency across 
multiple relational schema versions. In such settings, two distinct schemas and 
an explicit description of the mapping between them are assumed as input. The 
problem then consists in reformulating a query expressed in terms of one schema 
into a semantically equivalent query in terms of the other schema: see jSJ |^ 
and more recently [12] with references thereof. 

In addition to the fundamental differences between XML and the relational 
data model, in the more general case of XML processing, schemas constantly 
evolve in a distributed, independent, and unpredictable environment. The re- 
lations between different schemas are not only unknown but hard to track. In 
this context, one priority is to help maintaining query consistency during these 
evolutions, which is still considered as a challenging problem [16]. 

The work found in [T3| discusses the impact of evolving XML schemas on 
query reformulation. Based on a taxonomy of XML schema changes during their 
evolution, the authors provide informal - not exact nor systematic - guidelines 
for writing queries which are less sensitive to schema evolution. In fact, study- 
ing query reformulation requires at least the ability to analyze the relationship 
between queries. For this reason, a closely related work is the problem of deter- 
mining query containment and satisfiability under type constraints [TJini- The 
work found in [Ij studies the complexity of XPath emptiness and containment 
for various fragments (see [2] and references thereof for a survey) . 

The main distinctive idea pursued in this paper is to develop a logical ap- 
proach for guiding schema and query evolution. In contrast to the classical use 
of logics for proving properties such as query emptiness or equivalence [TJ [S] , the 
goal here is different in that we seek to provide the necessary tools to produce 
relevant knowledge when such relations do not hold. 

Outline 

The rest of this paper is organized as follows: the next section introduces our 
framework. Section [3] presents its underlying logic, and Section |4] presents predi- 
cates for characterizing the impact of schema changes. We report on experiments 
on realistic scenarios in Section [5| before we conclude in Section |6] 

2 Analysis Framework 

Our framework allows the automatic verification of properties related to XML 
schema and query evolution. In particular, it offers the possibility of checking 
fine-grained properties on the behavior of queries with respect to successive ver- 
sions of a given schema. The system can be used for checking whether schema 
evolutions require a particular query to be updated. Whenever schema evolu- 
tions may induce query malfunctions, the system is able to generate annotated 
XML documents that exemplify bugs, with the goal of helping the programmer 
to understand and properly overcome undesired effects of schema evolutions. 
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select C"a//b [ancestor : : e] " 
type C "XHTMLl-strict . dtd" , 
"html")) 



XML Problem Description (Text 
File) 



^Parsing^aiicf 
Compilation 



let $X-e ti <1>$X. 



Logical formula 
over binary trees 
with attributes 



Satisfiability 
Test 



Unsatisfiablc {property proved) 



Satisfiablc - 



> Satisiy'ing > Sample XML 

Synthesis . . . binary to , . 

^ bmary tree document 

with ^ inducing a bug 

attributes 



Figure 1: Framework Overview. 



For these purposes, our framework relies on the combination and joint use 
of several contributions: 

• an extension of the logic introduced in [9] to deal with XML attributes 
(Sections |2] and |3|; 

• a set of logical features and high-level predicates specifically designed for 
studying and characterizing schema and query compatibility issues when 
schemas evolve (Section [i]) ; 

• a range of applications and procedures to cope with schema and query 
evolution (Section [s]) ; 

• a full implementation of the whole system, including: 

— a parser for reading the problem description (text file) , which in turn 



use specific parsers for schemas (Section 2.2 1, queries (Section 2.3 1 
logical formulas (Section 3.2 1, and predicates (Section [4]); 



— compilers for translating schemas and queries into their logical rep- 



resentations (Sections 3.3 and 3.4 1 



— an optimized solver first described in [5J [TU] for checking satisfiability 

of logical formulas in time 2*^*^"' where n is the formula size; 

— and a counter example XML tree generator (described in [TIT). 

Figure [T] illustrates how the previous software components are combined and 
used together, in a simplified overview of the global framework. We next intro- 
duce the data model we consider for XML documents, schemas and queries. 

2.1 XML Trees with Attributes 

An XML document is considered as a finite tree of unbounded depth and arity, 
with two kinds of nodes respectively named elements and attributes. In such a 
tree, an element may have any number of children elements, and may carry zero, 
one or more attributes. Attributes are leaves. Elements are ordered whereas 
attributes are not, as illustrated on Figure [4] In this paper, we focus on the 
nested structure of elements and attributes, and ignore XML data values. 

2.2 Type Constraints 

As an internal representation for tree grammars, we consider regular tree type 
expressions (in the manner of extended with constraints over attributes. 
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Assuming a set of variables ranged over by x, we define a tree type expression 
as follows: 

T ::= tree type expression 

empty set 

empty sequence 

T I r disjunction 

T, T concatenation 

lia) [r] element definition 

X variable 

let x.T in T binder 

We impose a usual restriction on the recursive use of variables: we allow un- 
guarded (i.e. not enclosed by a label) recursive uses of variables, but restrict 
them to tail positionsj^ With that restriction, tree types expressions define 
regular tree languages. In addition, an element definition may involve simple 
attribute expressions that describe which attributes the defined element may 
(or may not) carry: 

a ::= attribute expression 

() empty list 

list I a disjunction 

list ::— attribute list 

list, list commutative concatenation 

11 optional attribute 

/ required attribute 

-il prohibited attribute 

Our tree type expressions capture most of the schcmas in use today [HI [3]. 
In practice, our system provides parsers that convert DTDs, XML Schemas, 
and Relax NGs to this internal tree type representation. Users may thus define 
constraints over XML documents with the language of their choice, and, more 
importantly, they may refer to most existing schemas for use with the system. 



2.3 Queries 

The set of XPath expressions we consider is given by the syntax shown on 
Figure [2] The semantics of XPath expressions is described in [5], and more 
formally in |17j . We observed that, in practice, many XPath expressions contain 
syntactic sugars that can also fit into this fragment. Figure [3] presents how our 
XPath parser rewrites some commonly found XPath patterns into the fragment 
of Figure [2j where the notation {axiswnt)'^ stands for the composition of k 
successive path steps of the same form: axiswnt/ .../ axiswnt. 

k steps 



3 Logical Setting 
3.1 Logical Data Model 

It is well-known that there exist bijective encodings between unranked trees 
(trees of unbounded arity) and binary trees. Owing to these encodings binary 

^For instance, "let x.Ha) [r] , x | () in a;" is allowed. 
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query 



path ::= 



qualifier ::= 



nt ::= 



axis ::= 



/path 
path 

query \ query 
query (1 query 

path/ path 

path[qualifier] 

axisv.nt 

qualifier and qualifier 
qualifier qualifier 
not{qualifier) 
path 

path/@nt 
@nt 



absolute path 
relative path 

union 

intersection 

path composition 
qualified path 
step 

conjunction 
disjunction 
negation 
path 

attribute path 
attribute step 

node test 
node label 
any node label 

tree navigation axis 



self I child | parent 

descendant | ancestor 

descendant-or-self 

ancestor-or-self 

following-sibling 

preceding-sibling 

following I preceding 



Figure 2: XPath Expressions. 



[position() = 1] 
[position() = last()] 
ni[position() = 

fe>i 

count{path) = 
count (pai/i) > 
count (n<) > ^^k^ 

fe>0 



nf[not(preceding-sibling::nf)] 
nf[not(following-sibling::n<)] 
~^ ni[(preceding-sibling::ni)'^~^] 

B.ot{path) 

path 

^ n</ (following-sibling: :n<)'^ 



preceding-sibling::* [position = last() and qualifier] 

preceding-sibling::*[not(preceding-sibling::*) and qualifier] 

Figure 3: Syntactic Sugars and their Rewritings. 



RR n° 6711 



8 



Geneves, Layaida, & Quint 



trees may be used instead of unranked trees without loss of generality. In the 
sequel, we rely on a simple "first-child & next-sibling" encoding of unranked 
trees. In this encoding, the first child of an element node is preserved in the 
binary tree representation, whereas siblings of this node are appended as right 
successors in the binary representation. Attributes are left unchanged by this 
encoding. For instance. Figure [5] presents how the sample tree of Figure |4] is 
mapped. 




XML Notation 



Figure 4: Sample XML Tree with Attributes. 




Figure 5: Binary Encoding of Tree of Figure |4] 

The logic we introduce below, used as the core of our framework, operates 
on such binary trees with attributes. 

3.2 Logical Formulas 

The concrete syntax of logical formulas is shown on Figure [6] where the meta- 
syntax (X)® means one or more occurences of X separated by commas. The 
reader can directly use this syntax for encoding formulas as text files to be used 
with the system described in Section [2] [8] . This concrete syntax is used as a 
single unifying notation throughout all the paper. 

The semantics of logical formulas corresponds to the classical semantics of 
a /i-calculus interpreted over finite tree structures. A formula is satisfiable iff 
there exists a finite binary tree with attributes for which the formula holds at 
some node. This is formally defined in [3], and we review it informally below 
through a series of examples. 
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formula 


T 


true 


F 


false 


I 


element name 


P 


atomic proposition 


# 


start context 


p 1 p 


disjunction 


p k p 


conjunction 


p=> p 


implication 


p <=> p 


equivalence 


(p) 


parenthesized formula 


> 


negation 


<p>p 


existential modality 


<1>T 


attribute named I 


$X 


variable 


let ($X = p)^ 


' in p binder for recursion 


predicate 


predicate (See Section |4]| 




program inside modalities 


1 


first child 


2 


next sibling 


-1 


parent 


-2 


previous sibling 



Figure 6: Syntax of Logical Formulas. 



There is a difference between an element name and an atomic propositioi^^ 
an element has one and only one element name, whereas it can satisfy multiple 
atomic propositions. We use atomic propositions to attach specific information 
to tree nodes, not related to their XML labeling. For example, the start context 
(a reserved atomic proposition) is used to mark the starting context nodes for 
evaluating XPath expressions. 

The logic uses programs for navigating in binary trees: the program 1 allows 
to navigate from a node down to its first successor and the program 2 for 
navigating from a node down to its second successor. The logic also features 
converse programs -1 and -2 for navigating upward in binary trees, respectively 
from the first successor to its parent and from the second successor to its previous 
sibling. Table [T] gives some simple formulas using modalities for navigating in 
binary trees, together with sample satisfying trees, in binary and unranked tree 
representations. 

The logic allows expressing recursion in trees through the recursive binder. 
For example the recursive formula: 

let $X = b I <2>$X in $X 

means that either the current node is named b or there is a sibling of the current 
node which is named b. For this purpose, the variable $X is bound to the 
subformula b I <2>$X which contains an occurence of $X (therefore defining 

^In practice, an atomic proposition must start with a 
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Sample Formula 


Tree 


XML 


a & <l>b 


a 

/ 

b 


<a><b/></a> 


a & <l>(b & <2>c) 


a 

/ 

b 

\ 

c 


<a><b/><c/></a> 


e & <-l>(d & <2>g) 


d 

/ \ 
e g 


<d><e/></d><g/> 


f k <-2>(g & ~<2>T) 


none 


none 



Table 1: Sample Formulas and Satisfying Trees. 



the recursion). The scope of this binding is the subformula that follows the 
"in" symbol of the formula, that is $X. The entire formula can thus be seen as 
a compact recursive notation for a infinitely nested formula of the form: 

b I <2>(b I <2>(b I <2>(. . .))) 

Recursion allows expressing global properties. For instance, the recursive for- 
mula: 

~ let $X = a I <1>$X I <2>$X in $X 

expresses the absence of nodes named a in the whole subtree of the current node 
(including the current node). Furthermore, the fixpoint operator makes possible 
to bind several variables at a time, which is specifically useful for expressing 
mutual recursion. For example, the mutually recursive formula: 

let 

$X = (a & <2>$Y) I <1>$X I <2>$X, 
$Y = b I <2>$Y 
in $X 

asserts that there is a node somewhere in the subtree such that this node is 
named a and it has at least one sibling which is named b. Binding several 
variables at a time provides a very expressive yet succinct notation for expressing 
mutually recursive structural patterns (that are common in XML Schemas, for 
instance) . 

From a theoretical perspective, the recursive binder let $X = </? in 93 cor- 
responds to the fixpoint operators of the /x-calculus. It is shown in [9 that the 
least fixpoint and the greatest fixpoint operators of the /i-calculus coincide over 
finite tree structures, for a restricted class of formulas called cycle-free formulas. 
Translations of XPath expressions and schemas presented in this paper always 
yield cycle-free formulas (see [l^ for more details). 
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3.3 Compilation of Queries 

The logic is expressive enough to capture the set of XPath expressions pre- 
sented in Section [23] For example, Figure |7] illustrates how the sample XPath 
expression: 

child: :r [child: :w/@att] 

is expressed in the logic. From a given context in an XML document, this 
expression selects all r child nodes which have at least one w child with an 
attribute att. Figure [7] shows how it is expressed in the logic, on the binary 
tree representation. The formula holds for r nodes which are selected by the 
expression. The first part of the formula, (f, corresponds to the step child: :r 
which selects candidates r nodes. The second part, ip, navigates downward 
in the subtrees of these candidate nodes to verify that they have at least one 
immediate w child with an attribute att. 



# 




att - - ^ V s 

\ \ 

" - - W TV 

Translated Query: child: :r [child: :w/@att] 
Translation: 

r & (let $X=<-1># I <-2>$X) & <l>let $Y=w & <att>T I <2>$Y 

^ V V ' 

ip if, 
Figure 7: XPath Translation Example. 

This example illustrates the need for converse programs inside modalities. 
The translated XPath expression only uses forward axes (child and attribute), 
nevertheless both forward and backward modalities are required for its logical 
translation. Without converse programs we would have been unable to differen- 
tiate selected nodes from nodes whose existence is simply tested. More generally, 
properties must often be stated on both the ancestors and the descendants of the 
selected node. Equipping the logic with both forward and converse programs is 
therefore crucial. Logics without converse programs may only be used for solv- 
ing XPath emptiness but cannot be used for solving other decision problems 
such as containment efficiently. 

A systematic translation of XPath expressions into the logic is given in [9^ . In 
this paper, we extended it to deal with attributes. We implemented a compiler 
that takes any expression of the fragment of Figure [2] and computes its logical 
translation. With the help of this compiler, we extend the syntax of logical 
formulas with a logical predicate select{" query" , ip). This predicate compiles 
the XPath expression query given as parameter into the logic, starting from a 
context that satisfies f. The XPath expression to be given as parameter must 
match the syntax of the XPath fragment shown on Figure |2] (or Figure |3]) . In 
a similar manner, we introduce the predicate exists(" query" , if) which tests 
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the existence of query from a context satisfying Lp, in a qualifier-like manner 
(without moving to its result). Additionally, the predicate se\ect,{" query") 
is introduced as a shortcut for select("gMer?/", #), where # simply marks the 
initial context node of the XPath expressiorj^ The predicate exists(" query") is 
a shortcut for exists("gMer?/",T). These syntactic extensions of the logic allow 
the user to easily embed XPath expressions and formulate decision problems out 
of them (like e.g. containment or any other boolean combination). In the next 
sections we explain how the framework allows combining queries with schema 
information for formulating problems. 



3.4 Compilation of Tree Types 

Tree type expressions are compiled into the logic in two steps: the first stage 
translates them into binary tree type expressions, and the second step actually 
compiles this intermediate representation into the logic. The translation proce- 
dure from tree type expressions to binary tree type expressions is well-known 
and detailed in [71. The syntax of output expressions follows: 

T ::= binary tree type expression 

empty set 

() empty tree 

T I T disjunction 
Z(a)[a:,x] element definition 

let x.T in r binder 



Attribute expressions are not concerned by this transformation to binary form: 
they are simply attached, unchanged, to new (binary) element definitions. Fi- 
nally, binary tree type expressions are compiled into the logic. The logical 
translation of an expression r is given by the function tr(r)^ defined below: 

tr(r)^'=;;'F for r - 0, () 
tr(ri |T2)^^='tr(ri)^ I tr(r2)^ 
tr(Ka) [xi, a;i] )^ == (/ & 93 & tra(o) & si(xi) & 52(2^2)) I "0 
tr(let x^.Ti in r)^ = let $Xj = tr(rj)^ in tr(r)^ 

where the function s.(-) sets the type frontier: 

{"'<p>T if X is hound to O 

~<p>'I I <p>$X if nullable{x) 
<p>$X if not nullable{x) 

according to the predicate nullahle{x) which indicates whether the type T =/= O 
bound to x contains the empty tree. 

■^This mark is especially useful for comparing two or more XPath expressions from the 
same context. 



INRIA 



Ensuring Query Compatibility with Evolving XML Schemas 



13 



The function tra(a) compiles attribute expressions associated with element 
definitions as follows: 

tra( ( ) ) == notothers( ( ) ) 
tra(/isf I a) = trSL{list) t notothers(Zisf) 
tr a,(list, list') = tTa,(list) k tra(foi') 
tra(/?) = 1 \ ~l 

tra(Z) = I 
tra(^Z) "^'l 

In usual schemas {e.g. DTDs, XML Schemas) when no attribute is specified 
for a given element, it simply means no attribute is allowed for the defined 
element. This convention must be explicitly stated into the logic. This is the 
role of the function "notothers(Zisi)" which returns the negated disjunction of 
all attributes not present in list. As a result, taking attributes into account 
comes at an extra-cost. The above translation appends a (potentially very large) 
formula in which all attributes occur, for each element definition. In practice, a 
placeholder atomic proposition is inserted until the full set of attributes involved 
in the problem formulation is known. When the whole formula has been parsed, 
placeholders are replaced by the conjunction of negated attributes they denote. 
This extra-cost can be observed in practice, and the system allows two modes 
of operations: with or without attributes]^ Nevertheless the system is still 
capable of handling real world DTDs (such as the DTD of XHTML 1.0 Strict) 
with attributes. This is due to (1) the limited expressive power of languages 
such as DTD that do not allow for disjunction over attribute expressions (like 
"Zzsi I a" ); and, more importantly, (2) the satisfiability-testing algorithm which 
is implemented using symbolic techniques [TUj. 

Tree type expressions form the common internal representation for a variety 
of XML schema definition languages. In practice, the logical translation of a tree 
type expression r are obtained directly from a variety of formalisms for defining 
schemas, including DTD, XML Schema, and Relax NG. For this purpose, the 
syntax of logical formulas is extended with a predicate type(" •",•). The logical 
translation of an existing schema is returned by type( "/",/) where / is a file 
path to the schema file and I is the element name to be considered as the entry 
point (root) of the given schema. Any occurence of this predicate will parse the 
given schema, extract its internal tree type representation r, compile it into the 
logic and return the logical formula iT:{T)\. 

3.5 Type Tagging 

A tag (or "color" ) is introduced in the compilation of schemas with the purpose 
of marking all node types of a specific schema. A tag is simply a fresh atomic 
proposition passed as a parameter to the translation of a tree type expression. 
For example: tr(T)^jjj.jj^[ is the logical translation of r where each element defini- 
tion is annotated with the atomic proposition "xhtml". With the help of tags, 
it becomes possible to refer to the element types in any context. For instance, 

^The optional argument "-attributes" must be supplied for attributes to be considered. 
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one may formulate tr(T)^jjj.j^j I tr(r')gjjjjj for denoting the union of all r and 
r' documents, while keeping a way to distinguish element types; even if some 
element names are shared by the two type expressions. 

Tagging becomes even more useful for characterizing evolutions between suc- 
cessive versions of a single schema. In this setting, we need a way to distinguish 
nodes allowed by a newer schema version from nodes allowed by an older ver- 
sion. This distinction must not be based only on element names, but also on 
content models. Assume for instance that t' is a newer version of schema r. If 
we are interested in the set of trees allowed by r' but not allowed by t then we 
may formulate: 

tr(r')? &~tr(r)^ 

If we now want to check more fine-grained properties, we may rather be inter- 
ested in the following (tagged) formulation: 

tr(T')Iii &~tr(r)T°'''-™'"P'°'"™' 

In this manner, we can distinguish elements that were added in r' and whose 
names did not occur in r, from elements whose names already occured in r 
but whose content model changed in r', for instance. In practice, a type is 
tagged using the predicate type("/", Z, 93, (^') which parses the specified schema, 
converts it into its logical representation r and returns the formula tvir)'^ . Such 
kind of type tagging is useful for studying the consequences of schema updates 
over queries, as presented in the next sections. 



4 Analysis Predicates 

This section introduces the basic analysis tasks offered to XML application de- 
signers for assessing the impact of schema evolutions. In particular, we propose 
a mean for identifying the precise reasons for type mismatches or changes in 
query results under type constraints. 

For this purpose, we build on our query and type expression compilers, and 
define additional predicates that facilitate the formulation of decision problems 
at a higher level of abstraction. Specifically, these predicates are introduced 
as logical macros with the goal of allowing system usage while focusing (only) 
on the XML-side properties, and keeping underlying logical issues transparent 
for the user. Ultimately, we regard the set of basic logical formulas (such as 
modalities and recursive binders) as an assembly language, to which predicates 
are translated. 

We illustrate this principle with two simple predicates designed for checking 
backward-compatibility of schemas, and query satisfiability in the presence of a 
schema. 

• The predicate backward_incompatible(r, r') takes two type expressions 
as parameters, and assumes t' is an altered version of r. This predicate is 
unsatisfiable iff all instances of r' are also valid against r. Any occurrence 
of this predicate in the input formula will automatically be compiled as 
tr(T')^ &~tr(T)^. 

• The predicate non_empty("(7Mer?/", r) takes an XPath expression (with the 
syntax defined on Figure ^ and a type expression as parameters, and is 
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unsatisfiable iff the query always returns an empty set of nodes when 
evaluated on an XML document valid against r. This predicate compiles 
into select("gMer?/",tr(T)^ & #) where the predicate select{" query" ,ip) 
compiles the XPath expression query into the logic, starting from a context 



that satisfies as explained in Section 3.3 This can be used to check 
whether the modification of the schema does not contradict any part of 
the query. 

Notice that the predicate non_empty("giter2/", r) can be used for checking 
whether a query that is valicj^ against a schema remains valid with an updated 
version of a schema. In other terms, this predicate allows determining whether a 
query that must always return a non-empty result (whatever the tree on which it 
is evaluated) keeps verifying the same property with a new version of a schema. 

A second, more-elaborated, class of predicates allows formulating problems 
that combine both a query query and two type expressions r, r' (where r' is 
assumed to be a evolved version of t): 

• new_element_name("(;Mer?/", r, r') is satisfied iff the query query selects 
elements whose names did not occur at all in t. This is especially useful 
for queries whose last navigation step contains a "*" node test and may 
thus select unexpected elements. This predicate is compiled into: 

"element(T) & select{" query" ,ti{T')j) 

where element (r) is another predicate that builds the disjunction of all el- 
ement names occuring in r. In a similar manner, the predicate attribute(iy9) 
builds the logical disjunction of all attribute names used in (p. 

• new_region("(7Mer?/", T, r') is satisfied iff the query query selects elements 
whose names already occurred in r, but such that these nodes now occur 
in a new context in r'. In this setting, the path from the root of the 
document to a node selected by the XPath expression query contains a 
node whose type is defined in t' but not in t as illustrated below: 

node 

selected by / \ Patb from 

\ selected node 
^ V contains node 
\ in r' \r 



XML document valid against - 
but not against r 



^We say that a query is valid iff its negation is unsatisfiable. 
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The predicate new_region(" gitery", r, r') is logically defined as follows: 

new_region(" gitery", r, r') = 

select{" query" MTf_^^^ &~tr(r')T"°"-'°'°P'™"*) 
fe"' added_element(r, r') 
& ancestor(_old_complement) 
& ^ descendant (_old_complement) 
fe"' f ollowing(_old_coinplement) 
&"'preceding(_old_complement) 



The previous definition heavily relies on the partition of tree nodes defined 
by XPath axes, as illustrated by Figurejsj The definition of new_region( " query" 
uses an auxiliary predicate added_element(T, t') that builds the disjunc- 
tion of all element names defined in r' but not in t (or in other terms, 
elements that were added in r'). In a similar manner, the predicate 
added_attribute((y5, Lp') builds the disjunction of all attribute names de- 
fined in r' but not in t. 




descendant 



Figure 8: XPath axes: partition of tree nodes. 

The predicate new_region("(7Mer?/", r, r') is useful for checking whether a 
query selects a different set of nodes with r' than with r because selected 
elements may occur in new regions of the document due to changes brought 
by r'. 

• new_coiitent("giter?/", r, r') is satisfied iff the query query selects elements 
whose names were already defined in t, but whose content model has 
changed due to evolutions brought by t', as illustrated below: 
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node 

selected by 
query 




selected node 
has changed 
(new content 
model) 



XML document valid against r' 
but not against r 



The definition of new_content("gMery", t, r' 



•') follows: 



new_content( " query" , r, r') = 



) 



fe"' added_element(T, t') 



&- Eiiicestor(added_eleinent(r, r')) 
& descendant (_old_complement) 
&" f ollowing(_old_complement) 
&''preceding(_old_complement) 



The predicate new_content("gMer?/",T, t') can be used for ensuring that 
XPath expressions will not return nodes with a possibly new content model 
that may cause problems. For instance, this allows checking whether an 
XPath expression whose resulting node set is converted to a string value 
(as in, e.g. XPath expressions used in XSLT "value-of" instructions) is 
affected by the changes from r to r'. 

The previously defined predicates can be used to help the programmer iden- 
tify precisely how type constraint cvohitions affcc;t queries. They can even be 
combined with usual logical connectives to formulate even more sophisticated 
problems. For example, let us define the predicate exclude(</?) which is satisfi- 
able iff there is no node that satisfies in the whole tree. This predicate can 
be used for excluding specific element names or even nodes selected by a given 
XPath expression. It is defined as follows: 

exclude (tp) == 



This predicate can also be used for checking properties in an iterative manner, 
refining the property to be tested at each step. It can also be used for verifying 

fine-grained properties. For instance, one may check whether r' defines the 
same set of trees as r modulo new element names that were added in r' with 
the following formulation: 



This allows identifying that, during the type evolution from r to r', the query 
results change has not been caused by the type extension but by new composi- 
tions of nodes from the older type. 
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In practice, instead of taking internal tree type representations (as defined 
in Section 2.2) as parameters, most predicates do actually take any logical 
formula as parameter, or even schema paths as parameters. We believe this 
facilitates predicates usage and, most notably, how they can be composed to- 
gether. Figure [9] gives the syntax of built-in predicates as they are implemented 
in the system, where / is a file path to a DTD (.dtd), XML Schema (.xsd), 
or Relax NG (.rng). In addition of aforementioned predicates, the predicate 



■predicate ::= 

select(" query") 
se\ect,(" query" , if) 
exists(" query") 
exists(" query" , Lp) 

type("/",0 

type("/",?,(^,(/j') 

f orward_incompatible((p, ip') 

backward_incompatible((p, Lp') 

element ((p) 
attribute((/3) 
descendant((/?) 
exclude((p) 
added_element((p, ip') 
added_attribute((^, p') 

non.empty ( " query" , Lp) 
new_element_nEmie(" gitery", "/", "f'",l) 
new_region("guer?/", "/", "f'",l) 
nev_coTLtent{" query" , "f'",l) 
predicate-name ( (<^) ® ) 



Figure 9: Syntax of Predicates for XML Reasoning. 

descendaiit((y9) forces the existence of a node satisfying (p in the subtree, and 
predicate-name{{p)®) is a call to a custom predicate, as explained in the next 
section. 



4.1 Custom Predicates 

Following the spirit of predicates presented in the previous section, users may 
also define their own custom predicates. The full syntax of XML logical spec- 



ifications to be used with the system is defined on Figure 10 where the meta- 
syntax (X)® means one or more occurrence of X separated by commas. A 
global problem specification can be any formula (as defined on Figure |6]) , or a 
list of custom predicate definitions separated by semicolons and followed by a 
formula. A custom predicate may have parameters that are instanciated with 
actual formulas when the custom predicate is called (as shown on Figure |9| . 
A formula bound to a custom predicate may include calls to other predicates. 
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Schema Variables Elements Attributes 



XHTML 1.0 basic DTD 


71 


52 


57 


XHTML 1.1 basic DTD 


89 


67 


83 


MathML 1.01 DTD 


137 


127 


72 


MathML 2.0 DTD 


194 


181 


97 



Table 2: Sizes of (Some) Considered Schemas. 

but not to the currently defined predicate (recursive definitions must be made 
through the let binder shown on Figure [6]) . 

spec ::— 

ip formula (see Fig. |6| 

def;(p 

def 

predicate-name{{l)®) = Lp' custom definition 
dej\ def list of definitions 

Figure 10: Global Syntax for Specifying Problems. 



5 Framework in Action 

We have implemented the whole software architecture described in Section [2] 
and illustrated on Figure [l] [8J . We have carried out extensive experiments of 
the system with real world schemas such as XHTML, MathML, SVG, SMIL 
(Table [2] gives details related to their respective sizes) and queries found in 
transformations such MathML content to presentation [TS]. We present two of 
them that show how the tool can be used to analyze different situations where 
schemas and queries evolve. 

Evolution of XHTML Basic 

The first test consists in analyzing the relationship (forward and backward com- 
patibility) between XHTML basic 1.0 and XHTML basic 1.1 schemas. In par- 
ticular, backward compatibility can be checked by the following command: 

backward_incompatible ( "xhtml-basiclO . dtd" , 

"xhtml-basicll.dtd", "html") 

The test immediately yields a counter example as the new schema contains 
new element names. The counter example (shown below) contains a style 
element occurring as a child of head, which is not permitted in XHTML basic 
1.0: 

<html> 
<head> 
<title/> 

<style type="_otherV"/> 
</head> 
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<body/> 
</htinl> 



The next step consists in focusing on the relationship between both schemas 
excluding these new elements. This can be formulated by the following com- 
mand: 

backward_incoinpatible ( "xhtml -basiclO. dtd" , 

"xhtml -basicll.dtd", "html") 
& exclude (added_element( 

type ( " xhtml-bas iclO.dtd"," html " ) , 

type ( "xhtml-basicll . dtd" , "html" ) ) ) 

The result of the test shows a counter example document that proves that 
XHTML basic 1.1 is not backward compatibk^ with XHTML basic 1.0 even if 
new elements are not considered. In particular, the content model of the label 
element cannot have an a element in XHTML basic 1.0 while it can in XHTML 
basic 1.1. The counter example produced by the solver is shown below: 

<htiiil> 
<head> 
<object> 
<label> 
<a> 

<img/> 
</a> 
<img/> 
</label> 
<param/> 
</object> 
<meta/> 
<title/> 
<base/> 
</head> 
<body/> 
</html> 

XTML basic 1.0 validity error: element "a" is not declared 
in "label" list of possible children 



Notice that we observed similar forward and backward compatibility issues with 
several other W3C normative schemas (in particular for the different versions 
of SMIL and SVG). Such backward incompatibilities suggests that applications 
cannot simply ignore new elements from newer schemas, as the combination of 
older elements may evolve significantly from one version to another. 

MathML Content to Presentation Conversion 

MathML is an XML format for describing mathematical notations and capturing 
both its structure and graphical structure, also known as Content MathML and 
Presentation MathML respectively. The structure of a given equation is kept 
separate from the presentation and the rendering part can be generated from 
the structure description. This operation is usually carried out using an XSLT 
transformation that achieves the conversion. In this test series, we focus on the 
analysis of the queries contained in such a transformation sheet and evaluate 
the impact of the schema change from MathML 1.0 to MathML 2.0 on these 
queries. 
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Most of the queries contained in the transformation represent only a few 
patterns very similar up to element names. The following three patterns are the 
most frequently used: 

Ql: //apply [*[1] [self : :eq]] 

Q2 : I /apply [* [1] [self : : apply] / inverse] 

Q3: //sin [preceding-sibling: : * [position()=last() 

and (self :: compose or self :: inverse)] ] 

The first test is formulated by the following command: 

new_region ( " Q 1 " , "mathml . dtd" , "mathinl2 . dtd" , "math" ) 

The result of the test shows a counter example document that proves that the 
query may select nodes in new contexts in MathML 2.0 compared to MathML 
1.0. In particular, the query Ql selects apply elements whose ancestors can be 
declare elements, as indicated on the document produced by the solver: 

<math xmlns : solver="http : / /warn . inrialpes . f r/xml " 
solver : coiitext=" true" > 

<declare> 

<apply solver :target="true"> 

<eq/> 
</ apply> 
<condition/> 
</declare> 
</matli> 



Notice that the solver automatically annotates a pair of nodes related by the 
query: when the query is evaluated from a node marked with the attribute 
solver : context, the node marked with solver : target is selected. To evaluate 
the efi'ect of this change, the counter example is filled with content and passed 
as an input parameter to the transformation. This shows immediately a bug in 
the transformation as the resulting document is not a MathML 2.0 presentation 
document. Based on this analysis, we know that the XSLT template associated 
with the match pattern Ql must be updated to cope with MathML evolution 
from version 1.0 to version 2.0. 

The next test consists in evaluating the impact of the MathML type evolution 
for the query C)2 while excluding all new elements added in MathML 2.0 from 
the test. This identifies whether old elements of MathML 1.0 can be composed 
in MathML 2.0 in a different manner. This can be performed with the following 
command: 

new_content("q2" , "mathml. dtd" , "mathml2.dtd" , "math") 
& exclude (added_element (type ("mathml. dtd" , "math") , 

type ("mathml2. dtd" , "math"))) 

The test result shows an example document that efi'ectively combines MathML 
1.0 elements in a way that was not allowed in MathML 1.0 but permitted in 
MathML 2.0. 

<math xmlns : solver="http : //warn. inrialpes . f r/xml" 
solver :context=" true" > 
<apply solver :target="true"> 
<apply> 

<inverse/> 
</apply> 

< annot at i on- xml > 



RR n° 6711 



22 



Geneves, Layaida, & Quint 



<math/> 
</annotation-xml> 
<condition/> 
</apply> 
</matli> 



Similarly, the last test consists in evaluating the impact of the MathML type 

evolution for the query Q3, excluding all new elements added in MathML 2.0 
and counter example documents containing declare elements (to avoid trivial 
counter examples): 

new_regions("Q3" , "mathml.dtd" , "mathml2.dtd" , "math") 
& exclude (added_eleinent (type ( "mathml . dtd" , "math" ) , 

type("mathml2.dtd" , "math"))) 

& exclude (declare) 

The counter example document shown below illustrates a case where the sin 
element occurs in a new context. 

<math xmlns : solver="littp : //warn, inrialpes.fr/xml" 
solver : context="true"> 

<apply> 

<aniiotatioii-xml> 
<niath> 
<apply> 

<inverse/> 

<sin solver:target="true"/> 
</apply> 
</math> 
</annotation-xml> 
</apply> 
</matli> 



Applying the transformation on previous examples yields documents which 
axe neither MathML 1.0 nor MathML 2.0 valid. As a result, the stylesheet 
cannot be used safely over documents of the new type without modifications. 
In addition, the required changes to the stylesheet are not limited to the addition 
of new templates for MathML 2.0 elements. The templates that deal with the 
composition of MathML 1.0 elements should be revised as well. 

All the previous tests were processed in less than 30 seconds on an ordinary 
laptop computer running Java under Mac OS X. 

6 Conclusion 

In this article, we present a logical framework and a tool for verifying for- 
ward/backward compatibility issues caused by schemas and queries evolution. 
The tool allows XML designers to identify queries that must be reformulated in 
order to produce the expected results across successive schema versions. With 
this tool designers can examine precisely the impact of schema changes over 
queries, therefore facilitating their reformulation. We gave illustrations of how 
to use the tool for both schema and query evolution on realistic examples. 
In particular, we considered typical situations in applications involving W3C 
schemas evolution such as XHTML and MathML. The tool can be very useful 
for standard schema writers and maintainers in order to assist them enforce 
some level of quality assurance on compatibility between versions. 
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There are a number of interesting extensions to the proposed system. First, 
the set of predicates can be easily enriched to detect more precisely the impact 
on queries. For example, one can extend the tagging to identify separately every 
navigation step and qualifier in a query expression. This will help greatly in the 
identification and reformulation of the navigation steps or qualifiers affected by 
schemas evolution. 
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